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ABSTRACT 

Conference paper assignment, i.e., tlie task of assigning pa- 
per submissions to reviewers, presents multi-faceted issues 
for recommender systems research. Besides the traditional 
goal of predicting 'who likes what?', a conference manage- 
ment system must take into account aspects such as: re- 
viewer capacity constraints, adequate numbers of reviews for 
papers, expertise modeling, conflicts of interest, and an over- 
all distribution of assignments that balances reviewer pref- 
erences with conference objectives. Among these, issues of 
modeling preferences and tastes in reviewing have tradition- 
ally been studied separately from the optimization of paper- 
reviewer assignment. In this paper, we present an integrated 
study of both these aspects. First, due to the paucity of data 
per reviewer or per paper (relative to other recommender 
systems applications) we show how we can integrate multi- 
ple sources of information to learn paper-reviewer preference 
models. Second, our models are evaluated not just in terms 
of prediction accuracy but in terms of the end-assignment 
quality. Using a linear programming-based assignment opti- 
mization formulation, we show how our approach better ex- 
plores the space of unsupplied assignments to maximize the 
overall affinities of papers assigned to reviewers. We demon- 
strate our results on real reviewer preference data from the 
IEEE ICDM 2007 conference. 
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1. INTRODUCTION 

Modern conferences, especially in areas such as data min- 
ing/machine learning (KDD; ICDM; ICML; NIPS) and da- 
tabases/web (VLDB; SIGMOD; WWW), are beset with ex- 
cessively high numbers of paper submissions. Assigning 
these papers to appropriate reviewers in the program com- 
mittee (which can constitute a few hundred members) is a 
herculean task and hence motivates the use of recommender 
systems. 

Besides the traditional goal of predicting 'who likes what?', 
a conference management system must take into account as- 
pects such as: reviewer capacity constraints, adequate num- 
bers of reviews for papers, expertise modeling, conflicts of 



interest, and an overall distribution of assignments that bal- 
ances reviewer preferences with conference objectives. Among 
these, issues of modeling preferences, expertise, and tastes 
in reviewing have traditionally been studied separately from 
the optimization of paper-reviewer assignment. The former 
has been the subject of much academic research (see Sec- 
tion 



2.1 1 while the latter is emphasized by commercial soft- 



ware, such as EasyChair, CyberChair, and Microsoft's CMS, 
which aim to automate the management of the conference 
reviewing process. 

We investigate the conference paper assignment problem 
(CPAP) through the lens of recommender systems research. 
There are three key differences from traditional recommender 
systems research and the CPAP problem. First, in a tradi- 
tional recommender, recommendations that meet the needs 
of one user do not affect the satisfaction of other users. In 
CPAP, on the other hand, multiple users (reviewers) are bid- 
ding to review the same papers and hence there is the pos- 
sibility of one user's recommendations (assignments) affect- 
ing the satisfaction levels (negatively) of other users. Hence 
the design of reviewer preference models must be posed and 
studied in an overall optimization framework. 

Second, in a conventional recommender, the goal is often 
to recommend new entities that are likely to be of interest, 
whereas in CPAP, the goal is to ensure that reviewers are 
predominantly assigned their (most) preferred papers. Nev- 
ertheless, preference modeling is still crucial because it gives 
the assignment algorithm some degree of latitude in aiming 
to satisfy multiple users. 

Finally, recommender systems are used to working with 
sparse data but the amount of 'signal' available to model 
preferences in the CPAP domain is exceedingly small; hence 
we must integrate multiple sources of information to build 
strong preference models. 

In this paper, we present the first integrated study of both 
modeling reviewing preferences and optimizing assignments 
for conference management. Our key contributions can be 
summarized as follows. 

1. Due to the paucity of data per reviewer or per pa- 
per (relative to other recommender systems applica- 
tions) we show how we can integrate information about 
publication subject categories, contents of paper ab- 
stracts, and co-authorship information to learn im- 
proved paper-reviewer preference models. 

2. We evaluate our models not just in terms of prediction 
accuracy but in terms of the end-assignment quality. 
Using a linear programming-based assignment opti- 
mization formulation, we show how our approach bet- 



ter explores the space of unsupplied assignments to 
maximize the overall affinities of papers assigned to 
reviewers. 

3. We demonstrate the effectiveness of our approach on 
actual reviewing preference data in the context of a 
real life conference, namely the IEEE ICDM'07 con- 
ference [19| . 

2. RELATED RESEARCH 

Any conference management system must contend with 
two main issues: how to model affinities or preferences be- 
tween papers and reviewers, and how to use these affinities 
to make and/or optimize assignments. For the former issue, 
many conferences have an explicit 'bidding' phase and use 
data collected in this phase as the affinity matrix. While 
many conferences use these bids as-is, we will demonstrate 
how they can be used as the starting point to build improved 
preference models. Approaches to solve the latter issue have 
traditionally been considered orthogonal to the problem of 
preference modeling but, as we demonstrate later, better 
preference modeling leads to improvements in this phase as 
well. 

2.1 Modeling Affinities, Preferences, and Ex- 
pertise 

The sparsity of reviewer-paper bidding data has led some 
researchers, e.g., Rigaux 20 , to explore the use of collabora- 
tive filtering techniques |6[ |11| to 'grow' the given bids. The 
underlying assumption is that reviewers who bid similarly 
on a number of the same papers are likely to have similar 
preferences for other papers. Basu et al 1 use the relational 
WHIRL system to integrate similarity scores from disparate 
data sources to identify most relevant (paper, reviewer) com- 
binations. They do not, however, attempt to satisfy per- 
paper or per-reviewer constraints, and the contributions of 
different sources are considered equivalent to each other. 
Popescul et al Ts' present a way to combine content-based 
and collaborative recommendations using a three-way aspect 
model. The GRAPE system 
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prefers topical information 
over supplied reviewer bids or preferences, but does use the 
preferences as a secondary means of modeling. The ratio- 
nale is the view that topical data more accurately predicts 
the degree of expertise present for a reviewer-paper match. 
Since the distribution of reviewers and papers over topics 
is unpredictable (sometimes leaving too many or too few 
reviewers for a given cluster of papers) , the preference infor- 
mation is used for tuning or smoothing out the wrinkles in 
the topic-based assignments. 

A problem faced by most expertise modeling approaches 
is identifying which topics are covered in papers. Early ef- 
forts in this area focused mainly on paper abstracts, and 
topical expertise was determined through common informa- 
tion retrieval methods involving keywords. For example, 
Dumais and Nelson [H] match papers to reviewers using La- 
tent Semantic Indexing (LSI) trained on reviewer-supplied 
abstracts. Yarowsky and Florian [27] extended this idea by 
using a similar vector space model with a naive Bayes clas- 
sifier on work previously published by each reviewer. 

More recently, Wei & Croft 26 describe a topic-based 



The APT model contains a number of features designed to 
better capture the reality of the relationship between confer- 
ence reviewers and papers. The idea is that an author may 
study and write about several distinct topics; by clustering 
papers from each of these topics into a separate persona for 
an author, the author's ranking for a given topic need not 
be diluted by his or her writings on a different topic. 

2.2 Optimizing Assignments 

Given preference data, either explicitly gathered or com- 
putationally modeled, the actual task of making assignments 
can be viewed as bipartite matching. The classical approach 
to bipartite matching is given by the Hungarian Algorithm 
described by Kuhn [Ts]; it provides a solution for the sim- 
plest cases of this family of problems (applicable when the 
number of reviewers equals the number of papers). Vari- 
ous refinements have been made to this algorithm over the 
such as one by Hopcroft and Karp 
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A number 



years, 

of contemporary assignment systems take this approach, in- 
cluding GRAPE 14 . For practical reasons, it is useful to 
restrict the number of reviews per reviewer and per paper; 
a constraint based linear program, e.g., work by Taylor [25| , 
is a natural approach. 

Another approach to CPAP uses reasoning from the much 
more general minimal cost network flow problems studied in 
dynamics and operations research. Many such related prob- 
lems (known collectively as extended Generalized Assign- 
ment Problems [5] or GAP) of assigning a limited number 
of resources to certain tasks exist in diverse fields. In the 
network flow diagram of this general assignment problem, 
resources (in our case, reviewers) are represented by source 
nodes with a certain supply (number of reviews allowed per 
reviewer), while tasks (each paper to be reviewed) are sink 
nodes with a demand (the number of times each paper must 
be reviewed). For specific approaches, see (tI 



model using a language model with Dirichlet smoothing. 
An excellent example of topic-based models is the Author- 
Persona-Topic (APT) model by Mimno & McCallum [16]. 



3. MODELS OF REVIEW PREFERENCES 

We adapt recommendation techniques for predicting un- 
known reviewer-paper preferences. Naturally, reviewers as- 
sume the role of "users" in traditional recommender systems, 
while papers take the role reserved to "products". Our goal 
is to exploit a variety of available information (see Fig. [TJ 
in order to get better estimates of those unknown prefer- 
ences. This, in turn, will allow the assignment algorithm to 
find better matches between reviewers and papers. First, we 
introduce some essential conventions. 

3.1 Notation and Dataset Description 

We are given ratings (henceforth, interchangeable with 
preferences) about m reviewers and n papers. We reserve 
special indexing letters for distinguishing reviewers from pa- 
pers: for reviewers u,v, and for papers A rating r„i in- 
dicates the preference by reviewer u of paper i, where high 
values mean stronger preferences. Usually the vast majority 
of ratings are unknown. 

As a concrete example, the dataset utilized in this pa- 
per comes from the Seventh IEEE International Conference 
on Data Mining (ICDM'07) held in Omaha, NE, USA (uti- 
lized here with permission). The originally supplied matrix 
is sparse: 529 papers, 203 reviewers, and only 6267 bids. 
This means that a reviewer rates about 31 papers on aver- 
age, while a paper recieves less than 12 ratings on average. 
Each rating refiects a bid a reviewer put on a paper, with 
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Figure 1: Data used in this paper for building paper-reviewer preference models. 



numerical values, between 1 and 4, indicating preferences as 
follows; 4= "High", 3="0K", 2="Low" and l="No". 

We distinguisii predicted ratings from known ones, by us- 
ing the notation f^i for the predicted value of r^i- To eval- 
uate the models we split the dataset into a train set, which 
contains about 90% of the preferences (randomly chosen), 
and a test set, which contains the rest preferences. Conse- 
quently, our models learn the train set and assign values to 
f ui for all (u, i)-pairs in the test set. Results from these runs 
are averaged over 100 iterations of training-test data splits. 

The quality of the results on a specific test set (TestSet) 
is measured by their root mean squared error (RMSE): 

^Jj2(u,i)eTestSeti'^^i " rmf/\TestSet\. The overall accu- 
racy of the model is taken as the mean RMSE over the 100 
randomly generated test sets. The reason for using such a 
randomization is the small size of our dataset, which makes 
each individual test set relatively small. 

We hasten to add that we do not advocate the myopic 
view of RMSE as the primary criterion for recommender 
systems evaluation. We use it in this section primarily due 
to its convenience for constructing direct optimizers. In the 
next section we will evaluate performance according to crite- 
ria more natural to the paper assignment problem. We also 
note that small improvements in overall RMSE will typi- 
cally translate into substantial improvements in bottom-line 
performance for predicting paper-reviewer preferences. 

In the following, we gradually expand the prediction model, 
by introducing into it a growing set of features. 

3.2 Baseline model 

Much of the variability in the data is explained by global 
effects, which can be reviewer- or paper-specific. It is im- 
portant to capture this variability by a separate component, 
thus letting the more involved models deal only with genuine 
reviewer-paper interactions. We model these global effects 
through: 

Tui = ^J. + bu + bi (1) 

The constant jj, indicates a global bias in the data, which 
is taken to be the overall mean rating. The parameter bu 
captures reviewer-specific bias, accounting for the fact that 
different reviewers use different rating scales. Finally, the 
paper bias, hi, accounts for the fact that certain papers tend 
to attract higher (or, lower) bids than others. 

We learn optimal values for fe„ (u = l,...,m) and bi 
{i — 1, . . . ,n), by minimizing the associated squared error 
function (or, equivalently, the train RMSE): 

min ^ {rui — ^J■ — bu — bi)^ + Xib'i + \2bi 

The regularizing term, i.e., Aib^ -I- A2&i , avoids overfitting 
by penalizing the magnitudes of the parameters. We set the 



values of the constants Ai and A2 by cross validation. Learn- 
ing is done by stochastic gradient descent (alternatively, any 
least squares solver could be used here). The resulting av- 
erage test RMSE is 0.6286. 

A separate analysis of each of the two biases shows re- 
viewer effect (/i + fou, with RMSE 0.6336) to be much more 
significant than paper bias (/i + fei, RMSE 1.2943) in re- 
ducing the error. This indicates a tendency of reviewers 
to concentrate all ratings near their mean ratings, which is 
supported by examination of the data. 

While the baseline model could explain much of the data 
variability, as evident by its relatively low associated RMSE, 
it is useless for making actual assignments. After all, it gives 
all reviewers exactly the same order of paper preferences. 
Thus, we are really after the remaining unexplained vari- 
ability, where reviewer-specific preferences are getting ex- 
pressed. Uncovering these preferences is the subject of the 
next subsections. 

3.3 A factor model 

Latent factor models comprise a common approach to col- 
laborative filtering with the goal to uncover latent features 
that explain observed ratings; examples include pLSA [9], 
neural networks 22 , and Latent Dirichlet Allocation [3]. 
We will focus on models that are induced by factorization 
of the reviewer-paper ratings matrix, which recently have 
gained popularity [ij [T2j [TT] [21] [24] , thanks to their attrac- 
tive accuracy and scalability. 

The premise of such models is that both reviewers and 
papers can be characterized as vectors in a common /-D 
space. The interaction between reviewers and papers is mod- 
eled by inner products in that space. Together, with the 
non-interaction signal covered in the previous subsection, a 
rating is predicted by the rule: 

r'ui = /i + 6u + bi + pZqi (2) 

Here, p„ G and qt € are the factor vectors of reviewer 
u and paper i, respectively. These are learnt by minimizing 
the associated squared error function, using stochastic gra- 
dient descent. The resulting average test RMSE is slowly 
decreasing when increasing the dimensionality of the latent 
factor space. E.g., for / = 50 it is 0.6240, and for / = 100 
it is 0.6234. Henceforth, we use / = 100. 

3.4 Subject categories 

While latent factor models automatically infer suitable 
categories, much can be learnt by known categories attributed 
to both papers and reviewers. In a typical conference sub- 
mission process, authors are requested to denote primary 
and secondary categories appropriate for their papers. Like- 
wise, reviewers are asked to indicate their interest along the 
same categories. It would be desirable to match reviewers 
with papers lying within their area of expertise. 



More specifically, for our dataset, which contains a num- 
ber of predefined categories judged relevant for ICDM'07 
(see Table [TJ, the entered matching between paper i and 
category c is denoted by: 



!1 c G primary(z) 
I c G secondary(i) 
otherwise 

The value assignment (1 for "primary", 0.5 for "secondary") 
is derived by cross validation and is quite intuitive. Simi- 
larly, we use the following for matching reviewers with their 
desired categories: 

(1 c G interest (it) 
— I c G conflict (m) 
otherwise 

Notice that in our dataset, reviewers could enter negative in- 
terest in certain categories, with which they have a "conflict 
of interest". 

This leads to a model, which measures the interaction 
between reviewers and papers based on the association of 
the respective entered categories, leading to: 



(3) 



The weights Wc indicate the significance of each category 
in linking a reviewer to a paper. Those are learnt auto- 
matically from the data by minimizing the squared error on 
the train set. It is plausible that, e.g., a mutual interest in 
some category A, will strongly link a reviewer to a paper, 
while a mutual interest in another category B is less infiu- 
ential on papers choice. For a concrete example, refer to 
Table [l] which shows the categories in our dataset sorted 
by their respective Wc values. We observe differences of or- 
ders of magnitude in the ability of different categories to 
correctly predict associations of reviewers to papers. Note 
in particular that there is no obvious monotonic relationship 
between the weight imputed to categories and the number 
of papers/reviewers associated with the category. 

The resulting average test RMSE of the model is: 0.6243. 
This can be improved by integrating with the latent factor 
model, yielding: 

fui = + bu + bi+ p1_qi + ^ (TtcOucWc (4) 

C 

The RMSE here is 0.6197. 

3.5 Paper-paper similarities 

We inject paper-paper similarities into our models in a 
way reminiscent of item-item recommenders 23 . The build- 
ing blocks here are similarity values Sij, which measure the 
similarity of paper i and paper j. The similarities could be 
derived from the ratings data, but those are already covered 
by the latent factor model. Rather, we derive the similarity 
of two papers by computing the cosine of their abstracts. 
Usually we work with the square of the cosine, which better 
contrasts the higher similarities against the lower ones. 

This leads to a model where a reviewer's preference for 
a paper is derived from his preferences to similar papers, 
through a weighted average, as follows: 



= IJ. + bu + bi 



jGR(«) *»J 



(5) 



Here, the set R(it) contains all papers on which u bid. 
The constant a is for regularization: it is penalizing cases 
where the weighted average has very low support, that is 
Sj6R(u) very small (e.g., no similar paper was rated by 
u). In our dataset it was determined by cross validation to be 
0.001. The parameter 7 sets the overall weight of the paper- 
paper component. It is learnt as part of the optimization 
process (cross-validation could have been used as well). Its 
final value is closely 0.7. Overall, the resulting average test 
RMSE of this model is 0.6109, which is better than what 
other models could achieve so far. 

As usual, we combine the paper-paper similarities into 
our overall scheme, which further drops RMSE to: 0.6038, 
through the following model: 



Tui ^ ft + bu + bi + pZqi + ^ (JicducWc + 7 



E 



j6R{u) 



(6) 



3.6 Reviewer-reviewer similarities 

In analogy to paper-paper similarities, one can also use 
reviewer-reviewer similarities, in order to borrow preferences 
between like minded reviewers. This is reminiscent of clas- 
sic user-user collaborative filtering. Once again, we do not 
want to derive user-user relations directly from their prefer- 
ences, as the signal from there is already incorporated into 
the latent factor model. Instead, we resort to an additional 
data source for deriving those similarities. Here, one can 
use the publication histories of the reviewers. To model 
reviewer-reviewer similarities, we utilize the number of com- 
monly co-authored papers as reported in DBLP, denoted by 
Suv (More sophisticated choices are of course open for fu- 
ture exploration.) In parallel to the paper-paper model, a 
preference can be predicted by following the rule: 



A* + ^11 + + ' 



P + EugR(i) 'Siiu 



(7) 



Here, the set R(i) contains all reviewers that rated i. 
The regularizing constant (3 is penalizing cases where the 
weighted average has very low support, that is, EueR(i) 
is very small (e.g., no similar reviewer has rated i). It was 
determined by cross validation to be 0.001. The parameter 
(j> sets the overall weight of the reviewer-reviewer compo- 
nent. It is learnt as part of the optimization process, with 
final value close to 0.06 for our dataset. (Notice that 4> is 
much smaller than the analogous weight of the paper-paper 
component, 7 = 0.7.) Overall, the resulting RMSE of this 
model is 0.6262, thus offering less accuracy than its dual - 
the paper-paper model. In other settings, where higher qual- 
ity reviewer-reviewer similarities are available, the relative 
merit of the model may increase. 

3.7 Putting it all together 

The overall model benefits from integrating into it the 
reviewer-reviewer component by the combined rule: 



fui^lJ. + bu + bi+p^qi+ y aic9ucWc + 'y — 



+ > 



P + EueR(i) •^1"' 



jeR(u) "'J 
(8) 



Table 1: Subject categories used for associating reviewers and papers. Categories are ranked by their weights, 
which indicate the ability of each category to match papers to appropriate reviewers, as learnt by our model. 
For comparison the number of papers (assigned to the topic) and reviewers (claiming expertise in the topic) 
are also shown. 
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All parameters are learnt simultaneously by minimizing 
the associated squared error on the train set. This is our 
final prediction rule, which delivers an average test RMSE of 
0.6015. In the following section, we will show how filling up 
the unknown preferences using this model provides flexibility 
that enables deriving better paper assignments. 

4. OPTIMIZING PAPER ASSIGNMENT 

Our predicted preference matrix is now suitable for use 
with any of the optimization algorithms in Section [2. 2[ De- 
noting the output of our preference modeling as the affinity 
matrix P, the assignment problem can be formulated as mo- 
tivated in Taylor [25| : 

argmax trace (p^r) = argmax^^ P„jRuj, (9) 

where R„j G [0, 1] VM,j, 
and ^^R„j<Cp, Vm, 

and R„j < c^, Vj. 

u 

Here Cp represents the desired number of reviews per pa- 
per, and Cr is the desired maximum reviews per reviewer. 
The third and fourth lines in the equation above represent 
the constraints on the number of assignments for individual 
papers and to individual reviewers, respectively. Then the 
expression trace (P"^R) represents the global sum of affinity, 
or happiness of all reviewers across all assigned papers. In 
particular, by using the (binary) assignments matrix R as a 



factor, only the affinities from P for reviewer-paper combi- 
nations that exist in the final assignments R are counted in 
the sum. 

This integer programming problem (|9| is reformulated 
into an easier-to-manage linear programming problem by a 
series of steps, using the node-edge adjacency matrix, where 
every row corresponds to a node in R, and every column 
represents an edge [25]. This reformulation is a bit more 
complicated, but essentially maps the problem into the do- 
main of linear programming and hence solvable via methods 
such as Simplex or interior point programming. In particu- 
lar, as Taylor shows in [25], because the reformulated con- 
straint matrix is totally unimodular, there exists at least one 
globally optimal solution (assignment set) with integral (and 
due to the constraints, Boolean) coefficients. 

5. EXPERIMENTAL RESULTS 

We have already demonstrated the ability of our modeling 
to better capture reviewer-paper preferences. But do the im- 
proved models lead to better assignments? In other words, 
does the assignment algorithm leverage the improved mod- 
eling of preferences in ways that improve end-assignment 
quality? The key distinction is between preferences ver- 
sus assignments, an aspect that has not been emphasized 
in prior recommender systems research. 

We study these issues in the context of the IEEE ICDM'07 
conference data as described earlier. Data from real con- 
ferences is quite rare to come by (e.g., acknowledged also 
in Jl6|) and in the future we hope that more datasets will 
become available to boost recommender systems research in 



conference management. 

The primary questions we seek to investigate are: 

1. Do our preference models lead to improved topical rel- 
evance of assignments? 

2. Do our preference models lead to higher quality assign- 
ments? 

We use our preference model ([8| to predict ratings for po- 
tential assignments for which no expressed preferences ex- 
ist. Before doing assignments using Taylor's model ([9|, it 
is important to balance the rating scale of various review- 
ers. For example, some reviewers are very cooperative and 
tend to give mostly high ratings, while others are more cau- 
tious and give medium to low ratings. Taylor's model may 
concentrate only on reviewers with high ratings, which is 
undesirable. Thus, we suggest two alternative per-reviewer 
normalization strategies: 

1. Subtract the per-reviewer mean from each predicted 
rating to find the residual rating for each potential as- 
signment combination. (Henceforth dubbed as Resid.) 

2. Calculate normalized ratings for each reviewer, so 
that the sum of each reviewer's predicted ratings is 1. 
(Henceforth dubbed as Norm.) 

Regardless of the chosen normalization scheme, we add the 
normalized predicted rating to the original preferences; un- 
known values in the original preference matrix are consid- 
ered to be the mean rating value (2.5) to place them between 
the 'Ok' and 'Low' ratings. This forms our final input matrix 
P, which we feed into Taylor's optimization algorithm. 

5.1 Topical relevance 

To assess the topical relevance of the assignments, we 
evaluate them in terms of the mappings between papers/re- 
viewers and subject categories. For every (paper,reviewer) 
assignment, we compute the dot product of the category vec- 
tor of the paper with the category vector of the reviewer, and 
sum these dot products over the assignments made. Specifi- 
cally paper-subject scores are recorded on a 2/1/0 scale (pri- 
mary versus secondary versus neither) and reviewer-subject 
scores are recorded on a 1/-1/0 scale (interest versus con- 
flict versus neither). In our dataset here, every paper has 
exactly one primary and one secondary category and hence 
the dot product can yield a number between -3 (reviewer 
has a conflict with both primary and secondary paper cat- 
egories) and 3 (reviewer has interest in both paper cate- 
gories). While other topical measures are certainly possi- 
ble, the dot product method captures the relevance or 'on- 
topicness' of assignments made to each reviewer. We used a 
90% training- 10% test set split to learn our Norm and Resid 
models, and calculated the mean of the predicted ratings for 
each reviewer-paper pair across 100 iterations. 

Fig. [2] depicts the results in terms of percentage improve- 
ment over the baseline Taylor approach (i.e., where only the 
original preferences without any additional data were input 
to the LP). Note that the topical evaluation metric shows 
a measurable improvement using our modified ratings P as 
input to Taylor's linear program. Since our new models 
take topical relevance into account, this is not unexpected. 
However, we accomplished this topical optimization with- 
out degrading the Taylor algorithm's original 'rating sum' 
objective; in fact, both the models considered here slightly 
improve this objective as well (see Fig. [2|. 
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Figure 2: Topical relevance of assignments made 
with our approach versus Taylor's original formu- 
lation. 



5.2 Assignment Quality 

The common train-test split methodology, which was used 
in Section |3j is also useful for assessing assignment quality. 
Both prediction algorithm Q and assignment algorithm Q 
cannot see the given preferences within the test set. Clearly, 
the elimination of the test set's preferences limits the fiexi- 
bility of the assignment algorithm, as it has a lower number 
of favorable preferences from which to choose. However, the 
prediction model fills this gap by providing estimates to all 
missing preferences, including those in the test set. This 
simulates the real life scenario, where the given reviewer 
ratings (corresponding to the training set) are limiting the 
possibilities of assignment algorithm, but by revealing more 
ratings to the algorithms (including the test set) they gain 
the flexibility to provide better assignments. 

As the proportion of the test set increases, we take away 
more available preferences, which simulates an increasingly 
harsher assignment environment. Accordingly, we evaluated 
several possible proportions, ranging from 50% of the given 
preferences within the test set, to 30% of preferences in 
the test set. In each experiment, we employed a series of 
20 random train-test split and evaluated assignment qual- 
ity. The baseline is Taylor's original algorithm, where all 
missing ratings, including those in the test set, are treated 
as "unknowns." We compare this baseline against the two 
aforementioned alternatives, Resid and Norm. 

We evaluate quality of assignments by their ability to 
make good use of the hidden ratings in the test set. The 
results are presented in Figs.[3]|4] &[5] and were fairly con- 
sistent over the different proportions of the test set. As 
illustrated here, the predominant number (around 60-70%) 
of test assignmentsmade using the original preference ma- 
trix fall in the unpreferred ("No") category. On the other 
hand, when imputing the missing ratings, using either Resid 
or Norm, the balance completely changes in favor of higher 
quality preferences. Resid makes about 50-60% of test as- 
signments out of the highest quality ratings ("High"), and 
only about 15% of test assignments are bad ("No"). Norm is 
close, but not quite as good as Resid, a difference that should 
be further investigated over additional datasets. Overall we 
flnd the results strongly support our goal to increase assign- 
ment quality by providing more flexibility with additional 
ratings from which to choose. 
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Figure 3: Evaluating the assignments made by the 
unmodified Taylor algorithm and the new preference 
models w.r.t. reviewers' four categories of prefer- 
ences, using a 70-30 test-training set split, averaged 
across 20 iterations. Mean assignments per iter- 
ation, and each value's percent of assignments for 
each iteration, are indicated above each bar. 



6. DISCUSSION 

We have investigated the modehng of paper-reviewer pref- 
erences within a conference management system. The very 
limited data, typical to this context, requires identifying 
and exploiting multiple sources of information within a hy- 
brid recommendation model. The proposed models provide 
improved predictions of reviewer preferences. More impor- 
tantly, we showed how the improved modeling of such pref- 
erences can lead to improvements in actual review assign- 
ments. Encouraging experimental results demonstrate that 
the improved modeling can be well worth the effort in ensur- 
ing satisfaction of conference program committee reviewers. 
A key question for future work is to provide theoretical jus- 
tification for the empirical evidence presented here. We also 
intend to field the recommendation capabilities presented 
here in a real conference management system and gain fur- 
ther insights into the issues involved. 
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Figure 4: Evaluating the assignments made by the 
unmodified Taylor algorithm and the new preference 
models, using a 60-40 test-training set split. 
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Figure 5: Evaluating the assignments made by the 
unmodified Taylor algorithm and the new preference 
models, using a 50-50 test-training set split. 



7. REFERENCES 

[1] C. Basu, H. Hirsh, W. Cohen, and C. Nevill-Manning. 

Technical paper recommendation: a study in 

combining multiple information sources. Journal of AI 

Research, pages 231-252, 2001. 
[2] S. Bcnfcrhat. Conference paper assignment. 

International Journal of Intelligent Systems, 

16(10):1183, 2001. 
[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet 

allocation. Journal of Machine Learning Research, 

3:993-1022, 2003. 
[4] J. Canny. Collaborative filtering with privacy via 

factor analysis. In Proc. SIGIR'02, pages 238-245, 

2002. 

[5] S. T. Dumais and J. Nielsen. Automating the 

assignment of submitted manuscripts to reviewers. In 

Proc. SIGIR '92, pages 233-244, 1992. 
[6] D. Goldberg, D. Nichols, B. Oki, and D. Terry. Using 

collaborative filtering to weave an information 

tapestry. Commun. of the ACM, 35:61-70, 1992. 
[7] J. Goldsmith and R. H. Sloan. The AI conference 

paper assignment problem. In Pref. Handling for AI, 

Papers from the AAAI Workshop, 2007. 
[8] D. Hartvigsen, J. C. Wei, and R. Czuchlewski. The 

conference paper-rcvicwcr assignment problem. 

Decision Sciences, 30(3):865 876, 1999. 
[9] T. Hofmann. Latent semantic models for collaborative 

filtering. ACM Transactions on Info. Systems, 

22:89-115, 2004. 
[10] J. E. Hopcroft and R. M. Karp. An n'^-^ algorithm for 

maximum matching in bipartite graphs. SIAM 

Journal on Computing, 18:225-231, 1973. 
[11] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, 

L. R. Gordon, and J. Riedl. GroupLens: applying 

collaborative filtering to Usenet news. Commun. of the 

ACM, 40(3):77-87, 1997. 
[12] Y. Koren. Factorization meets the neighborhood: a 

multifaceted collaborative filtering model. In Proc. 

KDD'08, pages 426-434, 2008. 
[13] H. W. Kuhn. The Hungarian method for the 

assignment problem. Naval Research Logistics 

Quarterly, 2:83-97, 1955. 
[14] N. D. Mauro, T. M. A. Basile, and S. Ferilh. GRAPE: 

an expert review assignment component for scientific 

conference management systems. In Proc. 

IEA/AIE'2005, pages 789-798, 2005. 
[15] S. McNcc, J. Ricdl, and J. Konstan. Being accurate is 

not enough: how accuracy metrics have hurt 

recommender systems. In CHI Extended Abstracts, 

pages 1097-11101, 2006. 
[16] D. Mimno and A. McCallum. Expertise modeling for 

matching papers with reviewers. In Proc. KDD'07, 

pages 500-509, 2007. 
[17] A. Paterek. Improving regularized singular value 

decomposition for collaborative filtering. In Proc. 

KDD Cup and Workshop, 2007. 
[18] R. Popcscul, L. H. Ungar, D. M. Pcnnock, and 

S. Lawrence. Probabilistic models for unified 

collaborative and contout-bascd recommendation in 

sparse-data environments. In Proc. of the 17th Conf. 

on Uncertainty in AI, pages 437-444, 2001. 



[19] N. Ramakrishnan, O. Zaiaine, Y. Shi, C. Clifton, and 

X. Wu. Proc. ICDM'07, 2007. 
[20] P. Rigaux. An iterative rating method: application to 

web-based conference management. In Proc. SAC '04, 

pages 1682-1687, 2004. 
[21] R. Salakhutdinov and Mnih. Probabilistic matrix 

factorization. In Proc. NIPS'07, pages 1257-1264, 

2008. 

[22] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted 

boltzmann machines for collaborative filtering. In 

Proc. 24th Annual Intl. Conf. on Machine Learning, 

pages 791-798, 2007. 
[23] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. 

Item-based collaborative filtering recommendation 

algorithms. In Proc. 10th Intl. Conf. on the World 

Wide Web, pages 285-295, 2001. 
[24] G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Major 

components of the gravity recommendation system. 

SICKDD Explorations, 9:80-84, 2007. 
[25] C. J. Taylor. On the optimal assignment of conference 

papers to reviewers. Technical Report MS-CIS-08-30, 

University of Pennsylvania, 2008. 
[26] X. Wei and W. B. Croft. LDA-bascd document models 

for ad-hoc retrieval. In Proc. SIGIR '06, pages 

178-185, 2006. 
[27] D. Yarowsky and R. Florian. Taking the load off the 

conference chairs: towards a digital paper-routing 

assistant. In Proc. EMNLP'99., 1999. 



